DCU-ADAPT: Learning Edit Operations for Microblog Normalisation with the Generalised Perceptron
نویسندگان
چکیده
We describe the work carried out by the DCU-ADAPT team on the Lexical Normalisation shared task at W-NUT 2015. We train a generalised perceptron to annotate noisy text with edit operations that normalise the text when executed. Features are character n-grams, recurrent neural network language model hidden layer activations, character class and eligibility for editing according to the task rules. We combine predictions from 25 models trained on subsets of the training data by selecting the most-likely normalisation according to a character language model. We compare the use of a generalised perceptron to the use of conditional random fields restricted to smaller amounts of training data due to memory constraints. Furthermore, we make a first attempt to verify Chrupała (2014)’s hypothesis that the noisy channel model would not be useful due to the limited amount of training data for the source language model, i.e. the language model on normalised text.
منابع مشابه
Comparative Evaluation of Query Expansion Methods for Enhanced Search on Microblog Data: DCU ADAPT @ SMERP 2017 Workshop Data Challenge
The rapid growth in the availability of social media content posted during emergency situations is creating significant interest in research into how this information can be exploited to assist emergency relief operations and to help with emergency preparedness and in early warning systems. We describe the DCU ADAPT Centre participation in the microblog search data challenge at the SMERP 2017 w...
متن کاملAutomatically Constructing a Normalisation Dictionary for Microblogs
Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (e.g. tomorrow for tmrw). We use context information to gen...
متن کاملCombination of Feature Selection and Learning Methods for IoT Data Fusion
In this paper, we propose five data fusion schemes for the Internet of Things (IoT) scenario,which are Relief and Perceptron (Re-P), Relief and Genetic Algorithm Particle Swarm Optimization (Re-GAPSO), Genetic Algorithm and Artificial Neural Network (GA-ANN), Rough and Perceptron (Ro-P)and Rough and GAPSO (Ro-GAPSO). All the schemes consist of four stages, including preprocessingthe data set ba...
متن کاملMelbourne Language Group Microblog Track Report
This report outlines the TREC 2011 microblog track submission of the Language Technology Group at The University of Melbourne. The microblog track is an ad–hoc retrieval task over Twitter data with temporally-specified queries, and the requirement that all results must predate the query. Our objective is to establish baseline results for the task and study the relative impact of various factors...
متن کاملImproved-Edit-Distance Kernel for Chinese Relation Extraction
In this paper, a novel kernel-based method is presented for the problem of relation extraction between named entities from Chinese texts. The kernel is defined over the original Chinese string representations around particular entities. As a kernel function, the Improved-Edit-Distance (IED) is used to calculate the similarity between two Chinese strings. By employing the Voted Perceptron and Su...
متن کامل